Lab 6 - Logistic Regression¶

Obtain the data using the sklearn load_iris method¶

In [139]:
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)
iris_df = iris.frame

Redefine 2 classes for the target variable¶

In [140]:
y = iris.target_names[iris.target] == 'virginica'
iris_df['target'] = y
#iris_df['target'] = y.astype(int)

Explore the data¶

1 table with descriptive statistics for each of the two classes¶

In [141]:
# target == 'non-virginica'
iris_df[iris_df['target'] == 0].describe()
Out[141]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 100.000000 100.000000 100.000000 100.000000
mean 5.471000 3.099000 2.861000 0.786000
std 0.641698 0.478739 1.449549 0.565153
min 4.300000 2.000000 1.000000 0.100000
25% 5.000000 2.800000 1.500000 0.200000
50% 5.400000 3.050000 2.450000 0.800000
75% 5.900000 3.400000 4.325000 1.300000
max 7.000000 4.400000 5.100000 1.800000
In [142]:
# target == 'virginica'
iris_df[iris_df['target'] == 1].describe()
Out[142]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 50.00000 50.000000 50.000000 50.00000
mean 6.58800 2.974000 5.552000 2.02600
std 0.63588 0.322497 0.551895 0.27465
min 4.90000 2.200000 4.500000 1.40000
25% 6.22500 2.800000 5.100000 1.80000
50% 6.50000 3.000000 5.550000 2.00000
75% 6.90000 3.175000 5.875000 2.30000
max 7.90000 3.800000 6.900000 2.50000

This dataset allows for the classification of iris species based on the characteristics of iris. The independent variables are sepal length, sepal width, petal length, and petal width. This model will be designed to determine whether the iris species is virginica or not, based on the independent variables. If the target value is False (0), it is non-virginica, and if the target value is True (1), it is of the virginica species. The table summarizes the features of the flowers according to each target class.

1 Histogram per feature, for each of the two classes¶

In [143]:
import seaborn as sns
import matplotlib.pyplot as plt

fig, axs = plt.subplots(2, 2, figsize=(7, 7))

sns.histplot(data=iris_df, x="sepal length (cm)", hue="target", multiple="dodge", ax = axs[0, 0])
sns.histplot(data=iris_df, x="sepal width (cm)", hue="target", multiple="dodge", ax = axs[0, 1])
sns.histplot(data=iris_df, x="petal length (cm)", hue="target", multiple="dodge", ax = axs[1, 0])
sns.histplot(data=iris_df, x="petal width (cm)", hue="target", multiple="dodge", ax = axs[1, 1])

plt.show()

The histograms represent the frequency of each independent variable value and how it correlates to the target value, allowing for the determination of whether the iris is virginica or non-virginica. For non-virginica cases, the distribution is primarily found in areas with shorter sepal length, petal length, and petal width, while virginica tends to be distributed in areas with longer measurements. It can be observed that the distribution of sepal width is similar regardless of the target, indicating that it may not have a significant relationship with the target classification.

Correlation matrix between the four features¶

In [144]:
f, ax = plt.subplots(figsize=(6, 4))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(iris_df.corr(), annot=True, cmap=cmap)
Out[144]:
<Axes: >

By examining the correlation matrix, we can determine how much correlation each variable has. When listed in order of highest correlation with the target, it is found to be petal width (0.77), petal length (0.72), sepal length (0.64), and sepal width (-0.14).

3 additional graphs¶

Pairplot
Reference: https://www.kaggle.com/code/benhamner/python-data-visualizations?scriptVersionId=1465139&cellId=12

In [145]:
sns.pairplot(iris_df, hue="target", size=3, diag_kind="kde")
c:\Users\flag8\Work\CSCN8010\venv\CSCN8010_classic_ml\Lib\site-packages\seaborn\axisgrid.py:2095: UserWarning:

The `size` parameter has been renamed to `height`; please update your code.

Out[145]:
<seaborn.axisgrid.PairGrid at 0x224137917d0>

This pairplot shows the relationships between each independent variable and the distribution of the target values based on those variables. Since we need to create a model that distinguishes between virginica and non-virginica, we must select variables for the model design where the target values are clearly distinguishable. Overall, it can be observed that plots involving sepal width values do not clearly separate the target. In contrast, plots for the other variables generally allow for a clear distinction between the target values.

Boxplot
Reference: https://www.kaggle.com/code/benhamner/python-data-visualizations?scriptVersionId=1465139&cellId=13

In [146]:
iris_df.boxplot(by="target", figsize=(12, 6))
Out[146]:
array([[<Axes: title={'center': 'petal length (cm)'}, xlabel='[target]'>,
        <Axes: title={'center': 'petal width (cm)'}, xlabel='[target]'>],
       [<Axes: title={'center': 'sepal length (cm)'}, xlabel='[target]'>,
        <Axes: title={'center': 'sepal width (cm)'}, xlabel='[target]'>]],
      dtype=object)

This boxplot represents the distribution of variable values according to each target. It can be seen that for virginica, the distribution of petal length is primarily towards longer lengths, while for non-virginica, it is the opposite. For petal width and sepal length, it can also be observed that values tend to be higher for cases classified as virginica, though the difference is less pronounced compared to petal length. In the case of sepal width, the distribution is almost similar for both targets. Therefore, it can be predicted that creating a model using petal length as an independent variable would be advisable.

Radviz
Reference: https://www.kaggle.com/code/benhamner/python-data-visualizations?scriptVersionId=1465139&cellId=16

In [147]:
# A final multivariate visualization technique pandas has is radviz
# Which puts each feature as a point on a 2D plane, and then simulates
# having each sample attached to those points through a spring weighted
# by the relative value for that feature
from pandas.plotting import radviz
radviz(iris_df, "target")
Out[147]:
<Axes: >

RadViz enables the projection of an N-dimensional dataset into a 2D space, where the impact of each dimension is seen as a balance among the influences of all dimensions. Data that show high correlation are positioned nearer to each other on the unit circle. Most virginica and non-virginica cases cluster together in the middle, but it can be observed that certain non-virginica cases are significantly influenced by sepal width. Therefore, it can be concluded that using sepal width as an indicator to represent the general characteristics of all non-virginica cases is challenging.

Split the data to a train set, a validation set and a test set¶

In [148]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']], 
                                                    y, test_size = 0.2)
X_val, X_test, y_val, y_test = train_test_split(X_test[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']], 
                                                    y_test, test_size = 0.5)

Run four logistic regression models models, with 1,2,3 and 4 features¶

In [149]:
from sklearn.linear_model import LogisticRegression

log_reg_1 = LogisticRegression()
log_reg_2 = LogisticRegression()
log_reg_3 = LogisticRegression()
log_reg_4 = LogisticRegression()


log_reg_1.fit(X_train[['petal width (cm)']], y_train)
log_reg_2.fit(X_train[['petal width (cm)', 'petal length (cm)']], y_train)
log_reg_3.fit(X_train[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)']], y_train)
log_reg_4.fit(X_train[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']], y_train)
Out[149]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()

Based on the analysis results, the model was constructed by adding features in the order of highest correlation: petal width, petal length, and then sepal length. The feature that presents difficulty in classifying classes, sepal width, was added last.

Evaluate the models on the validation set¶

List in a table how well each model is doing for each of the instances in the validation set¶

To evaluate each model and observe the actual accuracy, models with 1 to 4 features were executed. The measure used to compare each model was the score() function of the LogisticRegression model. The score() function calculates the mean accuracy, which is the probability that the predictions on the test set match the actual values.

1-feature model

In [150]:
# predict class
pred = log_reg_1.predict(X_val[['petal width (cm)']])

# probability estimates
prob = log_reg_1.predict_proba(X_val[['petal width (cm)']])

# mean accuracy score
print("Mean accuracy of the model: ", log_reg_1.score(X_val[['petal width (cm)']], y_val))

# print table
pd.DataFrame({'probability of predicting verginica': prob[:, 1].round(3), 
              'actual prediction': pred, 'ground truth': y_val})
Mean accuracy of the model:  1.0
Out[150]:
probability of predicting verginica actual prediction ground truth
0 0.003 False False
1 0.006 False False
2 0.334 False False
3 0.334 False False
4 0.846 True True
5 0.624 True True
6 0.624 True True
7 0.948 True True
8 0.786 True True
9 0.786 True True
10 0.334 False False
11 0.003 False False
12 0.004 False False
13 0.003 False False
14 0.924 True True

2-feature model

In [151]:
# predict class
pred = log_reg_2.predict(X_val[['petal width (cm)', 'petal length (cm)']])

# probability estimates
prob = log_reg_2.predict_proba(X_val[['petal width (cm)', 'petal length (cm)']])

# mean accuracy score
print("Mean accuracy of the model: ", log_reg_2.score(X_val[['petal width (cm)', 'petal length (cm)']], y_val))

# print table
pd.DataFrame({'probability of predicting verginica': prob[:, 1].round(3), 
              'actual prediction': pred, 'ground truth': y_val})
Mean accuracy of the model:  0.9333333333333333
Out[151]:
probability of predicting verginica actual prediction ground truth
0 0.000 False False
1 0.000 False False
2 0.104 False False
3 0.198 False False
4 0.968 True True
5 0.499 False True
6 0.679 True True
7 0.964 True True
8 0.994 True True
9 0.764 True True
10 0.241 False False
11 0.000 False False
12 0.000 False False
13 0.000 False False
14 0.979 True True

3-feature model

In [152]:
# predict class
pred = log_reg_3.predict(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)']])

# probability estimates
prob = log_reg_3.predict_proba(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)']])

# mean accuracy score
print("Mean accuracy of the model: ", log_reg_3.score(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)']], y_val))

# print table
pd.DataFrame({'probability of predicting verginica': prob[:, 1].round(3), 
              'actual prediction': pred, 'ground truth': y_val})
Mean accuracy of the model:  0.9333333333333333
Out[152]:
probability of predicting verginica actual prediction ground truth
0 0.000 False False
1 0.000 False False
2 0.106 False False
3 0.226 False False
4 0.961 True True
5 0.496 False True
6 0.708 True True
7 0.961 True True
8 0.993 True True
9 0.749 True True
10 0.216 False False
11 0.000 False False
12 0.000 False False
13 0.000 False False
14 0.977 True True

4-feature model

In [153]:
# predict class
pred = log_reg_4.predict(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']])

# probability estimates
prob = log_reg_4.predict_proba(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']])

# mean accuracy score
print("Mean accuracy of the model: ", log_reg_4.score(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']], y_val))

# print table
pd.DataFrame({'probability of predicting verginica': prob[:, 1].round(3), 
              'actual prediction': pred, 'ground truth': y_val})
Mean accuracy of the model:  1.0
Out[153]:
probability of predicting verginica actual prediction ground truth
0 0.000 False False
1 0.000 False False
2 0.099 False False
3 0.211 False False
4 0.962 True True
5 0.501 True True
6 0.696 True True
7 0.959 True True
8 0.993 True True
9 0.732 True True
10 0.220 False False
11 0.000 False False
12 0.000 False False
13 0.000 False False
14 0.975 True True

Plot the decision boundary for three models¶

In [154]:
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as po

po.offline.init_notebook_mode()
X_val['target'] = y_val

1-feature

In [155]:
fig = px.scatter(X_val, x='petal width (cm)', y = 'target')

decision_boundary = -log_reg_1.intercept_ / log_reg_1.coef_
fig.add_trace(go.Scatter(x=[decision_boundary[0][0], decision_boundary[0][0]], y=[False, True], mode="lines", name="Decision Boundary"))

fig.show()

2-feature

In [156]:
import numpy as np

fig = px.scatter(X_val, x='petal width (cm)', y='petal length (cm)', color='target')

decision_boundary_x1 = np.linspace(iris_df[['petal width (cm)']].min(), iris_df[['petal width (cm)']].max(), 10)
decision_boundary_x2 = -log_reg_2.intercept_ / log_reg_2.coef_[0][1] - log_reg_2.coef_[0][0] / log_reg_2.coef_[0][1] * decision_boundary_x1
fig.add_trace(go.Scatter(x=[decision_boundary_x1[0][0], decision_boundary_x1[9][0]], y=[decision_boundary_x2[0][0], decision_boundary_x2[9][0]], 
                         mode="lines", name="Decision Boundary"))

fig.show()

3-feature

In [157]:
fig = px.scatter_3d(X_val, x='petal width (cm)', y='petal length (cm)', z='sepal length (cm)', color='target')

x1, x2 = np.meshgrid(np.linspace(iris_df[['petal width (cm)']].min(), iris_df[['petal width (cm)']].max(), 10), 
                     np.linspace(iris_df[['petal length (cm)']].min(), iris_df[['petal length (cm)']].max(), 10))

decision_boundary_x3 = -log_reg_3.intercept_ / log_reg_3.coef_[0][2] - log_reg_3.coef_[0][0] / log_reg_3.coef_[0][2] * x1 - log_reg_3.coef_[0][1] / log_reg_3.coef_[0][2] * x2

fig.add_trace(go.Surface(x=x1, y=x2, z=decision_boundary_x3, showscale=False, showlegend=True, opacity=0.8, name="Decision Boundary"))

fig.show()

Failure modes¶

When the models were evaluated using the validation set, the 1 and 4 feature model did not have any instances of incorrect classification. However, for the models with 2 and 3 features, each had one prediction out of 15 instances that was incorrect compared to the actual value. Upon examining the instances where the prediction was incorrect, it was found that all were the 6th instance in the validation dataset. Looking at the probability of predicting virginica for this instance, for the 2 and 3 feature models respectively, the probabilities were 0.499 and 0.496. This indicates that the probabilities are nearly 0.5, suggesting that, when viewed on a plot, these values are close to the decision boundary. When examining the plot, it is observed that the point lying on the decision boundary has a true original value, but the model predicted it as false.

In [160]:
print(X_val.iloc[6])
sepal length (cm)     5.9
sepal width (cm)      3.0
petal length (cm)     5.1
petal width (cm)      1.8
target               True
Name: 149, dtype: object

In a 2-feature model, looking at data with petal length and width at 5.1 cm and 1.8 cm, compared to the histogram, showing the model might predict wrongly. In a 3-feature model, adding sepal length make harder to distinctly classify the flower. This suggests adding more features might not always improve the model's predictions.

Recommend the best model (provide reasoning). Summarize the results of this model on the test set¶

In [159]:
# predict class
pred = log_reg_1.predict(X_test[['petal width (cm)']])

# probability estimates
prob = log_reg_1.predict_proba(X_test[['petal width (cm)']])

# mean accuracy score
print("Mean accuracy of the model: ", log_reg_1.score(X_test[['petal width (cm)']], y_test))

# print table
pd.DataFrame({'probability of predicting verginica': prob[:, 1].round(3), 
              'actual prediction': pred, 'ground truth': y_test})
Mean accuracy of the model:  0.9333333333333333
Out[159]:
probability of predicting verginica actual prediction ground truth
0 0.334 False False
1 0.334 False False
2 0.184 False False
3 0.003 False False
4 0.334 False False
5 0.252 False False
6 0.924 True True
7 0.786 True True
8 0.003 False False
9 0.624 True True
10 0.786 True True
11 0.252 False True
12 0.004 False False
13 0.003 False False
14 0.002 False False

When using the score() function, which represents the accuracy of the model predictions, the evaluation on the validation set showed that the model with 1 and 4 features had a score of 1.00, while the rest of the models scored 0.93. This indicates that the model with 1 and 4 features are the most accurate. However, adding sepal width, which has a low correlation with the target, to create a 4-feature model, and considering overfitting, it can be said that the 1-feature model is the best model. Nevertheless, the difference in accuracy among the models may not be considered significant, and it's possible that the models perform well only on this specific dataset. Additionally, given that the training data consisted of only 120 instances and the validation and test data used only 15 instances, it might be challenging to consider these results as definitive. A much larger dataset would be required to accurately evaluate which model is superior.

In reality, when the model with 1 feature was evaluated on the test set, a score of 0.93 was obtained, meaning that 14 out of the 15 instances were correctly predicted.